GDC 




Building a Low-Fragmentation 
Memory System for 64-bit Games 


Aaron MacDougall 

Senior Systems Programmer - SCE London Studio 


4 . 


r 


London Studii 


GAME DEVELOPERS CONFERENCE March 14-18, 2016 • Expo: March 16-18, 2016 #GDC16 


GDC 




Background 




Old memory system ported from PlayStation®3 
Fixed sized memory pools 
Emulated VRAM 
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Problems 

• Wasted a lot of memory 

• Every pool sized for worst case 

• Overhead with small allocations 

• Suffered from fragmentation 

• Texture streaming impractical 
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Memory Fragmentation 

• Heap fragmented in small non-contiguous blocks 

• Allocations can fail despite enough memory 

• Caused by mixed allocation lifetimes 
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Design Goals 

• Low fragmentation 

• High utilisation 

• Simple configuration 

• Support PlayStation®4 OS and PC 

• Support efficient texture streaming 

• Comprehensive debugging support 
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Virtual Memory 

• Process uses virtual addresses 

• Virtual addresses mapped to 
physical addresses 

• CPU looks up physical address 

• Requires OS and hardware support 
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Benefits of Virtual Memory 


• Reduced memory fragmentation 

• Fragmentation is address fragmentation 

• We use virtual addresses 

• Virtual address space is larger than physical 

• Contiguous virtual memory not contiguous in 
physical memory 
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Virtual Address Space 

944GB 


Physical Memory 

8GB 
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Memory Pages 

• Mapped in pages 

• x64 supports: 

• 4kB and 2MB pages 

• PlayStation®4 OS uses: 

. 16kB (4x4kB) and 2MB 

• GPU has more sizes 
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Page Sizes 


• 2MB pages fastest 

• 16kB pages wastes less memory 

• We use 64kB (4xl6kB pages) 

• Smallest optimal size for PlayStation®4 GPU 

• Also use 16kB for special cases 
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Onion Bus & Garlic Bus 

• CPU & GPU can access both 

• But at different bandwidths 


Onion = fast CPU access 
Garlic = fast GPU access 
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Flexible & Direct Memory on PlayStation®4 

• Same virtual address space 

• Flexible 

• 512MB pre-allocated by OS 

• 16kB pages mapped to Onion (CPU bus) 

• Direct 

• 16kB or 2MB pages 

• Must be allocated and mapped to Onion or Garlic (GPU bus) 

• Both emulated on PC using 64kB pages 
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Our Memory System 

• Splits up the entire virtual address space 

• Physical memory mapped on demand 

• Allocator modules manage their own space 

• Each module specialised 

• Allocator objects are the interface to the system 
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Allocator 

class Allocator 

{ 

public : 

virtual void* Allocate(size_t size, size_t align) = 0; 
virtual void Deallocate(void* pMemory) = 0; 
virtual size_t GetSize(void* pMemory) { return 0; } 

const char* GetName(void) const; 


}; 
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Example - GeneralAllocator 

void* GeneralAllocator : :Allocate(size_t size, size_t align) 

{ 

if (SmallAllocator : : Belongs(size, align)) 

return SmallAllocator : : Allocate(size, align); 
else if (mjnediumAllocator. Belongs(size, align)) 
return mjnediumAllocator .Allocate(size, align); 
else if (LargeAllocator : :Belongs(size, align)) 

return LargeAllocator : :Allocate(size, mjnappingFlags); 
else if (GiantAllocator : :Belongs(size, align)) 

return GiantAllocator : :Allocate(size, m_mappingFlags); 

return nullptr; 
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Our Virtual Address Space 


"Large" 160GB 


"Giant" 256GB 

"Medium" 8GB 



944GB 


Mem Tracing 1GB 


Bookkeeping 
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Physical Memory on PlayStation®4 

• Flexible memory already allocated 

• Direct memory split into 64kB pages 

• Allocated and deallocated on demand 

• Memory bus set when allocated 

• Two free lists containing unused pages 

• Onion 

• Garlic 
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Small Allocation Module 

• Majority of allocations are <=64 bytes 

• ~250,000 allocations - ~25MB 

• Pack together to prevent fragmentation 

• 16kB pages of same-sized allocations 

• No headers 
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16kB Virtual Pages 


Page Free List 
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16kB Virtual Pages 


Page Free List 
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16kB Virtual Pages 


Page Free List 












8 byte Free List 
16 byte Free List 
24 byte Free List 
32 byte Free List 
40 byte Free List 
48 byte Free List 
etc. 


16kB Page (8 byte entries) 


16kB Page (8 byte entries) 


16kB Page (16 byte entries) 
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Small Allocation Module Pros & Cons 


+ Tiny implementation 
+ Very low wastage 
+ Makes use of flexible memory 
+ Fast 

- Difficult to detect memory stomps 
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Large Allocation Module 


64kB 
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Reserves huge virtual address space (160GB) 
Each table divided into equal sized slots 
Maps and unmaps 64kB pages on demand 
Guarantees contiguous memory 
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Texture Streaming 

• Reserve large allocation slot 

• Rounded up to nearest pow 2 

• Load max of smallest mip and 64kB 

• Map and unmap pages on demand 

• No need to copy or defrag 
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Large Allocation Module Pros & Cons 


+ No headers 

+ Simple implementation (~200 lines of code) 

+ No fragmentation 

- Size rounded up to page size 

- Mapping and unmapping kernel calls relatively slow 




Medium Allocation Modules 

OMB 


• Medium 

• Headerless 


System Pool 
Render Pool 
Physics Pool 
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Medium Allocation Module 

• All other sizes go here 

• Non-contiguous virtual pages 

• Grows and shrinks 

• Traditional doubly linked list with headers 

• Unsuitable for Garlic memory 

• Headers stored with data 

• Pow2 free lists 
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Headerless Allocation Module 

• Used for GPU allocations 

• Small to medium allocations 

• Hash table lookup 
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Allocator Types 

• GeneralAllocator 

• VramAllocator 

• MappedAllocator 

• GpuScratchAllocator 

• FrameAllocator 


class Allocator 

{ 

public : 

virtual void* Allocate( 
size_t size, 
size_t align) = 0; 
virtual void Deallocate( 

void* pNemory) = 0; 
virtual size_t GetSize(void* pMem); 
const char* GetName(void) const; 

}; 

MM_NEW(pAllocator) NyType(); 
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GPU Scratch Allocator 

• Used by Tenderer for per frame allocations 

• Double buffered 

• No need to deallocate 

• Protected with atomics 
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GPU Scratch Allocator - Pros & Cons 


+ No headers or bookkeeping 
+ No fragmentation 
+ Fast! 

- Fixed size 

- Worst-case alignment wastes space 
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Frame Allocator 

• Frames pushed and popped 

• No need to free memory 

• Unique to each thread 

• Useful for temp work buffers 


#include <ls_common/memony/ScratchMem. h> 

struct Elem 

{ 

}; 

void ProcessElements(size_t numElements) 

{ 

Is : :ScratchMem frame; 

Elem* pElements = 

(Elem* ) MM_AL LOC_P ( 

&frame., 

sizeof(Elem) * numElements 

); 


} 
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Frame Allocator Pros & Cons 


+ No headers or bookkeeping 
+ No fragmentation 
+ No synchronisation 
+ Fast! 

- Careful passing pointers around! 
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Thread Safety 

• Mutexes at lowest level 

• Allocator instances not protected 
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Performance 

• Performance not the focus 

• Still important 

• Mapping/unmapping slow 

• No noticeable difference 

• Don't allocate much during game 

• File loading is bottleneck 
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Clear Values 

• memset to byte value 

• Keep it memorable 

• 0xFA - Flexible memory allocated 

• 0xFF - Flexible memory free 

• 0xDA - Direct memory allocated 

• 0xDF - Direction memory free 

• 0xAl - Memory allocated 

• 0xDE - Memory deallocated 
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Statistics 

• Track everything possible 

• Live graphs available 

• Recorded by automated tests 
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Tracing 

• Is: :MemoryTracing: : Lookup(0x000000D01F600000) 

• Watch window function call 

• Works for addresses in the middle of a block 


Watch 1 
Name 

9 © Is:: MemoryTracing::: Lookup (0x000000 dOlffjOQOOQ] 

- ¥ x 

Value 

0x000000 dldd4bd880 { ={ pAddress= 0x000000 d01f600000 pNext= 0x000000 dOlffiOOOOO { ={...} size=0 time=0 pAI locator 0x0000000000000000 {... 1 

0 p Ad dress 

0x000000 d01f600Q0Q 


9 0 pNext 

0x000000 dOlfiSOOOOO { ={ pAddress= 0x0000000000000000 pNext= 0x0000000000000000 { ={...} size=?”time=??? pAI locator??? pFile=??? Iine=?? : 


0 size 

1048576 


0 time 

1445713746 


0 0 pAI locator 

0x000000 dOl c292530 { mN am e= 0x000000 d01c292538 "Vram" mTracingEnabled=true mStats={ memoryUsed= 601941916 peakMemoryUsed=60 


0 0 pFile 

0x00000000364f751b "lsx_render.cpp" 


0 line 

1102 


0 deallocated 

false 


0 0 label 

0x000000 dldd4bd8a5 "" - 


0 iS* pStackFrames 

0x000000 dldd4bd8c0 {0x00000000331 b94d9 {ls;::AllocateMemory[size_t size_t; ls::AllocatorBase* const char* const char* 1 nt) }, 


* [0] 

0x00000000331 b94d9 {ls::AllocateMemory[size_t size_t IsjsAllocatorBase* const char* const char* int]} 


* [1] 

0x0000000031732220 { 1 sx_ren d en :In iti a 1 i s eC o n sta ntD i mTa rg ets[vo i d) } 


* [2] 

0x0000000031736 b el { 1 sx_ren d er: :In iti a 1 i se[c o n st 1 sx_ren d er: :In itP a ra ms&] } 


* [3] 

0x00000000315 d0552 { 1 sx_g a m e: Gams :In iti a 1 i s eG a m e[l sx_g a m siln itP a ra m &&] } 


* [4] 

0x0000000031 c7 bl d9 { G a m s::In iti a 1 isefvo i dj } 


* [5] 

0x00000000315352 d6 {malnfint, char**)} 


# [6] 

0x00000000322 edclf 


* [7] 

0x000000082303f2 c8 

T 
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Tracing 

• Accessed using iterators 

• Write to TTY 

• Dump to HTML file 

• Dump on: 

• Demand 

• Out of memory 

• Leak detection 
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W U Memory Dump 

«- G D file:/] 

Summary 

System 

► Overview 

► Virtual Memory 

► Physical Memory 
▼ Small Allocations 

Memory Mapped 53 12.00kB 

Mapped Memory Used 5312.00kB 
Memory Used 2855. 76kB 

Peak Memory Used 2861. 13kB 

Memory Wasted 3.18kB 

Allocation Count 3256 

Peak Allocation Count 37079 

► Medium Allocations 

► Large Allocations 

► Giant Allocations 

► Memory Tracing 


Details 


■Hi' — 


|memdump.html#a 


Allocators 

Units 

► Mam 

O Bytes 

► Debug 

► Render 

<»kB 

OMB 

► RenderMapped 

► MamScratch 

Filters 

▼ FileOp 

Memory Used 68.58kB 


Peak Memory Used 68.58kB 

Allocator: 

Allocation Count 33 

Type: | 

Peak Allocation Count 33 
► Resource 

Min Size: 
Max Size: 

► ResourceTemp 

► Resourcelo 

Min Address: 

► ResourceProcess 

► Scratch 

► DynDimT argetsVram 

Max Address: 
Min Age (s): 

► LuaVMPool 

Max Age (s): | 

► CmndProcessorVM 

Label: 

► Physx 

► Popcorn 

► Wwise 

File: 


Clear Filter 





Address 

Size T 

Type 

Allocator 

Age (s) 

Label 

File 

Line 

Callstack 


0xB 3400000 

10922.50 

Large 

Main 

16 


lsx_render.cpp 

1121 

► 


0xB 1400000 

10922.50 

Large 

Main 

16 


lsx_render.cpp 

1085 

► 


0XB9400000 

2048.00 

Large 

Main 

16 


lsx_entity . cpp 

129 

► 


0XB7400000 

2048.00 

Large 

Main 

16 


lsx_entity . cpp 

129 

► 


0XA6400010 

1353.68 

Medium 

Main 

12 


Gx/Gxc/GxcUtils . cpp 

72 

► 


0XA4028060 

1024.38 

Medium 

Main 

16 


lsx_transform.cpp 

359 

► 


0XA14DF240 

1024.00 

Medium 

Main 

16 


lsx_render . cpp 

1102 

► 


0XA4200010 

512.19 

Medium 

Main 

16 


lsx_transform. cpp 

361 

► 


0xA41281F0 

512.19 

Medium 

Main 

16 


lsx_transform. cpp 

360 

► 


0XA42800E0 

512.19 

Medium 

Main 

16 


lsx_transform. cpp 

365 

► 


0XA4300200 

256.13 

Medium 

Main 

16 


GfxoFont . cpp 

105 

► 


0XA149F180 

256.13 

Medium 

Main 

27 


GfxoFont . cpp 

105 

► 
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Memory Header Guards 

• Free bytes in medium allocation headers 

• Detect memory stomps 

• Often too late 

• Easy to spot lee7 speak in memory view © 

• 0xA110C8 
. 0XDE1E7E 
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Memory Block Sentinels 

• Bypass normal allocators 

• Each allocation in own page 

• Unmapped pages before and after 

• Crash on over/under write 
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Memory Protection Flags on PlayStation®4 

• Allocs protected using memory protection flags 

• Specified by each allocator instance 

• Crash when CPU or GPU accesses wrong memory 

• Prevents 

• Stomps from CPU/GPU 

• Unintentional read/write using slow bus 

• Wasting page tables 
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PlayStation®4 GPU Debugging 


• Keep mapping table at fixed address 

• Stores bus and protection flags 

• Two-stage lookup table to save space 

• Renderer validates addresses before submit 

• Modify shaders on load 

• Check address before read/write 
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Summary 

• Modern consoles have rich virtual memory support 

• Virtual memory provides many options 

• Design your memory system around your allocation patterns 

• Analysis is important 

• Small allocations are a good place to start 

• Modularised allocators make customisation easy! 

• Debug features are vital! 
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Questions? 

aaron_macdougall@scee.net 


